Building a Reproducible Mining Pipeline for heimskringla.no (MediaWiki) via HTML snapshots + the MediaWiki Action API: Whether the Heimskringla corpus (including the Frostaþingslög) can be searched automatically for an óðal–aþal lexical complex?

Arvid Narimani

doi:10.5281/zenodo.18647092

Description (Technical)

A reproducible, script-driven pipeline is specified for mining the MediaWiki corpus at heimskringla.no for attestations belonging to the curated óðal/aþal lexical complex. The workflow enforces a three-stage separation—(i) URL enumeration, (ii) per-page acquisition, and (iii) extraction plus matching—to isolate coverage decisions from network volatility and to preserve auditability. Corpus-wide coverage is obtained via the MediaWiki Action API using action=query&list=allpages with full continuation handling (apcontinue) and optional redirect exclusion (apfilterredir=nonredirects); a bounded category-harvesting mode is also supported. Each enumerated page is fetched once and persisted as a raw HTML snapshot with accompanying metadata (requested/resolved URL, timestamps, HTTP status, and captured revision identifiers). Text extraction is MediaWiki-aware, preferring the main content container and excluding predictable UI/editorial scaffolding; reference/notes strata can be separated and are excluded by default. Mining is performed against the derived clean-text layer using an invariant philological core (athal_core), while Heimskringla-specific adaptations are confined to span-safe keying normalization to reduce false negatives without rewriting evidential spans. Outputs include an append-only TSV concordance with KWIC context and stable character offsets, per-page text hashes for drift detection, and JSONL run manifests enabling resumable execution and revision-stable replay via captured oldid permalinks.

Description (Non-technical)

A practical method is presented for searching the Heimskringla website—an online library built on wiki software—for a specific family of Old Norse words related to inherited land and lineage (óðal/aþal). The approach is designed to be repeatable and trustworthy: first, it makes a complete list of the pages to examine; second, it saves an exact copy of each page as it was retrieved; third, it strips away menus, categories, and other website “scaffolding” so that only the real text is searched. The word-search logic is kept stable and unchanged, so results from different runs or different corpora remain comparable. Every finding is recorded with surrounding context and with enough provenance information to trace it back to the exact page version used, even if the website later changes. The end product is a transparent concordance—essentially a searchable evidence table—that supports philological analysis without relying on manual browsing or unreliable site-wide search boxes.

Building a Reproducible Mining Pipeline for heimskringla.no (MediaWiki) via HTML snapshots + the MediaWiki Action API

Description (Technical)

Description (Non-technical)

Bibliographic reference

Full text